Last updated: 2023-08-01
Checks: 5 1
Knit directory: WenjunLiu_Thesis_Chapter4/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200930) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Tracking code development and connecting the code version to the
results is critical for reproducibility. To start using Git, open the
Terminal and type git init in your project directory.
This project is not being versioned with Git. To obtain the full
reproducibility benefits of using workflowr, please see
?wflow_start.
library(tidyverse)
library(yaml)
library(scales)
library(pander)
library(glue)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(cowplot)
library(ggfortify)
library(magrittr)
library(cqn)
library(ggrepel)
library(DT)
# library(variancePartition)
library(BiocParallel)
library(UpSetR)
library(corrplot)
library(singscore)
library(msigdbr)
library(broom)
library(goseq)
library(WGCNA)
library(reactable)
library(ggforce)
panderOptions("table.split.table", Inf)
panderOptions("big.mark", ",")
theme_set(theme_bw())
config <- here::here("config/config.yml") %>%
read_yaml()
suffix <- paste0(config$tag)
sp <- config$ref$species %>%
str_replace("(^[a-z])[a-z]*_([a-z]+)", "\\1\\2") %>%
str_to_title()
ah <- AnnotationHub() %>%
subset(rdataclass == "EnsDb") %>%
subset(str_detect(description, as.character(config$ref$release))) %>%
subset(genome == config$ref$build)
stopifnot(length(ah) == 1)
ensDb <- ah[[1]]
genesGR <- read_rds(here::here("output/genesGR.rds"))
Gene annotations were again loaded from Ensembl Release 101. The previously defined
GenomicRanges object containing GC content and Gene Length
was also loaded, containing information for 67,990 genes.
Dominant cell types (stroma, epithelial, ducts or fat) at the time of tissue collection were recorded and this was read in.
init_cellType <- read_delim(here::here("config/sample_meta.txt"), delim = "\t")
init_cellType <- init_cellType %>%
mutate(Stroma = ifelse(str_detect(Dominant_cell_type, regex("Stroma", ignore_case = T)), TRUE, FALSE),
Epithelial = ifelse(str_detect(Dominant_cell_type, regex("Epithelial", ignore_case = T)), TRUE, FALSE),
Ducts = ifelse(str_detect(Dominant_cell_type, regex("ducts", ignore_case = T)), TRUE, FALSE),
Fat = ifelse(str_detect(Dominant_cell_type, regex("fat", ignore_case = T)), TRUE, FALSE),
patient = str_replace(patient, "TH", "TH-")) %>%
pivot_longer(c("Stroma", "Epithelial", "Ducts", "Fat"),
names_to = "cell_type",
values_to = "TF") %>%
mutate(cell_type = ifelse(TF, cell_type, NA)) %>%
dplyr::select(-c("TF", "Dominant_cell_type")) %>%
.[!is.na(.$cell_type),] %>%
chop("cell_type") %>%
mutate(cell_type = vapply(.$cell_type, function(x){
paste(x,collapse = ";")
}, character(1)))
samples <- config$samples %>%
here::here() %>%
read_tsv() %>%
left_join(init_cellType) %>%
mutate(
Filename = paste0(sample, ".r_1"),
condition = ifelse(Tumor,
paste("Tumor", treat, sep = "_"),
paste("Normal", treat, sep = "_")),
patient = vapply(.$patient, function(x){str_split(x, "-")[[1]][2]}, character(1)),
patient = ifelse(Tumor,
paste("Tumor", patient, sep = "-"),
paste("Normal",patient, sep = "-")),
desc = paste(patient, treat, sep = " ")
) %>%
dplyr::select(-c("name", "sample")) %>%
dplyr::rename(name = desc) %>%
mutate_if(
function(x){length(unique(x)) < length(x)},
as.factor
) %>%
mutate(
treat = relevel(treat, ref = "Veh")
)
Sample metadata was split to a normal and a primary malignant tumour group.
mergedSamples <- samples %>%
group_by(name, patient, treat, Tumor, cell_type, Tissue_type, Age, Diagnosis) %>%
tally()
normal_sample <- mergedSamples %>%
dplyr::filter(Tumor == FALSE) %>%
droplevels()
tumor_sample <- mergedSamples %>%
dplyr::filter(Tumor == TRUE) %>%
droplevels()
While we have tumours collected from 8 patients and 4 treatment arms, one DHT-treated sample was lost (i.e Tumor -8). For all downstream analyses, we will focus on the 31 ER-positive primary malignant tumours.
pander(
mergedSamples %>%
dplyr::filter(Tumor == TRUE),
caption = "*Summary of sequencing total runs for all tumour samples*"
)
| name | patient | treat | Tumor | cell_type | Tissue_type | Age | Diagnosis | n |
|---|---|---|---|---|---|---|---|---|
| Tumor-1 DHT | Tumor-1 | DHT | TRUE | Epithelial | Malignant | 41 | invasive carcinoma of no special type | 8 |
| Tumor-1 E2 | Tumor-1 | E2 | TRUE | Epithelial | Malignant | 41 | invasive carcinoma of no special type | 8 |
| Tumor-1 E2+DHT | Tumor-1 | E2+DHT | TRUE | Epithelial | Malignant | 41 | invasive carcinoma of no special type | 8 |
| Tumor-1 Veh | Tumor-1 | Veh | TRUE | Epithelial | Malignant | 41 | invasive carcinoma of no special type | 8 |
| Tumor-2 DHT | Tumor-2 | DHT | TRUE | Stroma;Epithelial | Malignant | 78 | infiltrating carcinoma of no special type | 8 |
| Tumor-2 E2 | Tumor-2 | E2 | TRUE | Stroma;Epithelial | Malignant | 78 | infiltrating carcinoma of no special type | 8 |
| Tumor-2 E2+DHT | Tumor-2 | E2+DHT | TRUE | Stroma;Epithelial | Malignant | 78 | infiltrating carcinoma of no special type | 8 |
| Tumor-2 Veh | Tumor-2 | Veh | TRUE | Stroma;Epithelial | Malignant | 78 | infiltrating carcinoma of no special type | 8 |
| Tumor-3 DHT | Tumor-3 | DHT | TRUE | Epithelial | Malignant | 65 | infiltrating carcinoma of no special type | 8 |
| Tumor-3 E2 | Tumor-3 | E2 | TRUE | Epithelial | Malignant | 65 | infiltrating carcinoma of no special type | 8 |
| Tumor-3 E2+DHT | Tumor-3 | E2+DHT | TRUE | Epithelial | Malignant | 65 | infiltrating carcinoma of no special type | 8 |
| Tumor-3 Veh | Tumor-3 | Veh | TRUE | Epithelial | Malignant | 65 | infiltrating carcinoma of no special type | 8 |
| Tumor-4 DHT | Tumor-4 | DHT | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
| Tumor-4 E2 | Tumor-4 | E2 | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
| Tumor-4 E2+DHT | Tumor-4 | E2+DHT | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
| Tumor-4 Veh | Tumor-4 | Veh | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
| Tumor-5 DHT | Tumor-5 | DHT | TRUE | Stroma | Malignant | 48 | infiltrating carcinoma of no special type | 8 |
| Tumor-5 E2 | Tumor-5 | E2 | TRUE | Stroma | Malignant | 48 | infiltrating carcinoma of no special type | 8 |
| Tumor-5 E2+DHT | Tumor-5 | E2+DHT | TRUE | Stroma | Malignant | 48 | infiltrating carcinoma of no special type | 8 |
| Tumor-5 Veh | Tumor-5 | Veh | TRUE | Stroma | Malignant | 48 | infiltrating carcinoma of no special type | 8 |
| Tumor-6 DHT | Tumor-6 | DHT | TRUE | Epithelial | Malignant | 43 | invasive carcinoma of no special type | 8 |
| Tumor-6 E2 | Tumor-6 | E2 | TRUE | Epithelial | Malignant | 43 | invasive carcinoma of no special type | 8 |
| Tumor-6 E2+DHT | Tumor-6 | E2+DHT | TRUE | Epithelial | Malignant | 43 | invasive carcinoma of no special type | 8 |
| Tumor-6 Veh | Tumor-6 | Veh | TRUE | Epithelial | Malignant | 43 | invasive carcinoma of no special type | 8 |
| Tumor-7 DHT | Tumor-7 | DHT | TRUE | Stroma | Malignant | 53 | invasive carcinoma of no special type | 8 |
| Tumor-7 E2 | Tumor-7 | E2 | TRUE | Stroma | Malignant | 53 | invasive carcinoma of no special type | 8 |
| Tumor-7 E2+DHT | Tumor-7 | E2+DHT | TRUE | Stroma | Malignant | 53 | invasive carcinoma of no special type | 8 |
| Tumor-7 Veh | Tumor-7 | Veh | TRUE | Stroma | Malignant | 53 | invasive carcinoma of no special type | 8 |
| Tumor-8 E2 | Tumor-8 | E2 | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
| Tumor-8 E2+DHT | Tumor-8 | E2+DHT | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
| Tumor-8 Veh | Tumor-8 | Veh | TRUE | Epithelial | Malignant | 76 | invasive carcinoma of no special type | 8 |
This data set is slightly unique in that samples were sequenced across multiple lanes and flowcells. Metadata was re-organised to enable summarising of counts across the multiple sequencing runs obtained for every sample. The initial PCA showed that variability between replicates and flowcells was dwarfed by variability between samples and whether it was tumor. As such, all counts were able to be merged to give a single set of counts for each individual sample.
treat_cols <- c(
Veh = rgb(0.7, 0.7, 0.7),
DHT = rgb(0.8, 0.2, 0.2),
E2 = rgb(0.2, 0.2, 0.8),
`E2+DHT` = rgb(1, 0.4, 1)
)
patient_cols <- hcl.colors(
n = length(levels(tumor_sample$patient)),
palette = "Spectral"
) %>%
setNames(levels(tumor_sample$patient))
age_cols <- hcl.colors(
n = 3,
palette = "Zissou 1"
) %>%
setNames(c("<30", "<50", ">=50"))
treat_shapes <- c(
Veh = 1,
DHT = 19,
E2 = 15,
`E2+DHT` = 17
)
# counts <- here::here("data/aligned/counts/counts.out.gz") %>%
# gzfile() %>%
# read_tsv(comment = "#") %>%
# dplyr::select(Geneid, ends_with("bam")) %>%
# rename_at(vars(ends_with("bam")), dirname) %>%
# rename_all(basename) %>%
# column_to_rownames("Geneid")
# mergedCounts <- counts %>%
# rownames_to_column("gene_id") %>%
# pivot_longer(
# cols = -gene_id,
# names_to = "Filename",
# values_to = "counts"
# ) %>%
# left_join(samples, by = "Filename") %>%
# group_by(
# gene_id, name, patient, treat, Tumor) %>%
# summarise(counts = sum(counts), .groups = "drop") %>%
# pivot_wider(
# id_cols = gene_id,
# values_from = counts,
# names_from = name
# ) %>%
# column_to_rownames("gene_id")
# saveRDS(mergedCounts, here::here("data/mergedCounts.rds"))
mergedCounts <- readRDS(here::here("data/mergedCounts.rds"))
rm(counts)
rm(samples)
gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 10956201 585.2 16488054 880.6 NA 16488054 880.6
Vcells 26362090 201.2 38677362 295.1 204800 32164469 245.4
minCPM <- 1.5
minSamples_tumor <- 8
genes2Keep_tumor <- mergedCounts[,str_subset(colnames(mergedCounts), "Tumor")] %>%
edgeR::cpm() %>%
is_greater_than(minCPM) %>%
rowSums() %>%
is_weakly_greater_than(minSamples_tumor)
\(>\) 1.5 counts per million (CPM) were required to observed in \(\geq\) 8 samples among tumor samples for a gene to be considered as detected.
Of the 60,671 genes contained in the annotation for this release, 42,234 genes were removed as failing this criteria for detection among tumor tissues, leaving 18,437 genes for downstream analysis.
a2 <- mergedCounts[,str_subset(colnames(mergedCounts), "Tumor")] %>%
edgeR::cpm(log = TRUE) %>%
as.data.frame() %>%
rownames_to_column("gene_id") %>%
as_tibble() %>%
pivot_longer(
cols = contains("Tumor"),
names_to = "name",
values_to = "logCPM"
) %>%
left_join(tumor_sample) %>%
ggplot(aes(logCPM, stat(density), colour = patient, linetype = treat)) +
geom_density() +
scale_colour_manual(values= patient_cols[str_subset(names(patient_cols), "Tumor")]) +
labs(
y = "Density",
colour = "Patient",
linetype = "Treatment"
)
b2 <- mergedCounts[genes2Keep_tumor,str_subset(colnames(mergedCounts), "Tumor")] %>%
edgeR::cpm(log = TRUE) %>%
as.data.frame() %>%
rownames_to_column("gene_id") %>%
as_tibble() %>%
pivot_longer(
cols = contains("Tumor"),
names_to = "name",
values_to = "logCPM"
) %>%
left_join(tumor_sample) %>%
ggplot(aes(logCPM, stat(density), colour = patient, linetype = treat)) +
geom_density() +
scale_colour_manual(values= patient_cols[str_subset(names(patient_cols), "Tumor")]) +
labs(
y = "Density",
colour = "Patient",
linetype = "Treatment"
)
plot_grid(
a2 + theme(legend.position = "none"),
b2,
labels = c("A", "B"),
ncol = 2,
rel_widths = c(0.4, 0.6))
Distributions of logCPM values on merged counts, A) before and B) after filtering of undetectable genes for the 31 tumor sampels. Some differences between patients were noted.
A dgeListwas formed for the tumour samples.
dge_tumor <- mergedCounts[genes2Keep_tumor,colnames(mergedCounts) %in% tumor_sample$name] %>%
DGEList(
samples = tumor_sample %>%
as.data.frame() %>%
set_rownames(.$name) %>%
.[,colnames(.)],
group = tumor_sample %>%
as.data.frame() %>%
set_rownames(.$name) %>%
.[,colnames(.)] %>%
pull(treat),
genes = genesGR[rownames(.)]
) %>%
calcNormFactors()
dge_tumor$samples %>%
ggplot(aes(treat, lib.size, fill = treat)) +
geom_col() +
geom_hline(yintercept = 1e7, linetype = 2) +
facet_wrap(~patient, ncol = 5) +
scale_y_continuous(
labels = comma, expand = expansion(c(0, 0.05))
) +
scale_fill_manual(values= treat_cols) +
labs(x = "Sample Name", y = "Library Size", fill = "Treatment") +
theme(legend.position = c(11/12, 1/6))
Library sizes of all tumour samples after removal of undetectable genes. The common-use minimum library size of 10 million reads is shown as a dashed line.
Given that previous work had only assessed the libraries prior to merging, the library sizes were checked after merging patient sequencing runs. The median library size was found to be 12.9 million reads for tumor samples which are just above the common-use minimum recommendation of 10 million reads/sample.
A PCA was performed on all the tumour samples.
pca_tumor <- dge_tumor %>%
edgeR::cpm(log = TRUE) %>%
t() %>%
prcomp()
No common direction based on treatment appears evident with each patient, and great inter-patient heterogenity was observed even among the control samples.
g1 <- pca_tumor %>%
autoplot(
data = dge_tumor$samples,
x = 1, y = 2,
colour = "treat",
shape = "treat",
size = 3
) +
facet_wrap(~patient, ncol = 4) +
scale_colour_manual(values= treat_cols) +
scale_shape_manual(values = treat_shapes) +
labs(
colour = "Treatment",
shape = "Treatment"
)
g2 <- pca_tumor %>%
autoplot(
data = dge_tumor$samples,
x = 1, y = 2,
colour = "treat",
shape = "treat",
size = 3
) +
scale_colour_manual(values= treat_cols) +
scale_shape_manual(values = treat_shapes) +
labs(
colour = "Treatment",
shape = "Treatment"
)
plot_grid(g1, g2,
ncol = 2,
rel_widths = c(1.8,1),
scale = c(1, 0.8),
labels = c("(a)", "(b)"))
PCA on logCPM from merged counts for (a) each of, (b) all of the 8 tumor patients. No common direction based on treatment appears evident with each patient, and great inter-patient heterogenity was observed even among the control samples.
The impact of GC content or length bias was assessed as a possible source.
Genes were divided in 10 approximately equal sized bins based on increasing length, and 10 approximately equal sized bins based on increasing GC content, with the final GC/Length bins being the combination 100 bins using both sets. The contribution of each gene to PC1 and PC2 was assessed and a t-test performed on each bin. This tests
\[ H_0: \mu = 0 \text{ against } H_A: \mu \neq 0 \]
where \(\mu\) represents the true contribution to PC1 of all genes in that bin.
If any bin makes a contribution to PC1 the mean will be clearly non-zero, whilst if there is no contribution the mean will be near zero. In this way, the impact of gene length and GC content on variance within the dataset can be assessed.
As seen below, the contribution of GC content and gene length to PC1
is very clear in tumor samples, with a smaller contribution being
evident across PC1. As a result, Conditional Quantile Normalisation
(CQN) is recommended in preference to the more common TMM
normalisation.
dge_tumor$genes %>%
dplyr::select(gene_id, ave_tx_len, gc_content) %>%
mutate(
GC = cut(
x = gc_content,
labels = seq_len(10),
breaks = quantile(gc_content, probs = seq(0, 1, length.out = 11)),
include.lowest = TRUE
),
Length = cut(
x = ave_tx_len,
labels = seq_len(10),
breaks = quantile(ave_tx_len, probs = seq(0, 1, length.out = 11)),
include.lowest = TRUE
),
bin = paste(GC, Length, sep = "_"),
PC1 = pca_tumor$rotation[gene_id, "PC1"],
PC2 = pca_tumor$rotation[gene_id, "PC2"]
) %>%
pivot_longer(
cols = c("PC1", "PC2"),
names_to = "PC",
values_to = "value"
) %>%
group_by(PC, GC, Length, bin) %>%
summarise(
Size = n(),
mean = mean(value),
sd = sd(value),
t = t.test(value)$statistic,
p = t.test(value)$p.value,
adjP = p.adjust(p, method = "bonf")
) %>%
ggplot(
aes(Length, GC, colour = t, alpha = -log10(adjP), size = Size)
) +
geom_point() +
facet_wrap(~PC) +
scale_colour_gradient2() +
scale_size_continuous(range = c(1, 10)) +
labs(alpha = expression(paste(-log[10], p))) +
theme(
panel.grid = element_blank(),
legend.position = "bottom"
)
Contribution of each GC/Length Bin to PC1 and PC2 among tumor samples. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values. The number of genes in each bin is indicated by the circle size. The clear pattern across PC1 is unambiguous.
cqNorma_tumor <- with(
dge_tumor,
cqn(
counts= counts,
x = genes$gc_content,
lengths = genes$ave_tx_len
)
)
dge_tumor$offset <- cqNorma_tumor$glm.offset
logCPM_tumor <- cqNorma_tumor$y + cqNorma_tumor$offset
# saveRDS(dge_tumor, here::here("output/dge_tumor.rds"))
# saveRDS(logCPM_tumor, here::here("output/logCPM_tumor.rds"))
a2 <- cqNorma_tumor$func1 %>%
as.data.frame() %>%
mutate(x = cqNorma_tumor$grid1) %>%
pivot_longer(
cols = any_of(colnames(dge_tumor)),
names_to = "name",
values_to = "QR fit"
) %>%
left_join(dge_tumor$samples) %>%
ggplot(
aes(x, `QR fit`, colour = patient, group = name, linetype = treat)
) +
geom_line() +
labs(x = "GC content", colour = "Patient", linetype = "Treatment")
b2 <- cqNorma_tumor$func2 %>%
as.data.frame() %>%
mutate(x = cqNorma_tumor$grid2) %>%
pivot_longer(
cols = any_of(colnames(dge_tumor)),
names_to = "name",
values_to = "QR fit"
) %>%
left_join(dge_tumor$samples) %>%
ggplot(
aes(x, `QR fit`, colour = patient, group = name, linetype = treat)
) +
geom_line() +
labs(
x = expression(paste(log[10], " Gene Length (kb)")),
colour = "Patient", linetype = "Treatment"
)
plot_grid(
a2 + theme(legend.position = "none"),
b2 + theme(legend.position = "none"),
nrow = 1)
Model fits used when applying CQN in all tumor samples. The divergent samples previously noted on the PCA are again quite divergent here. In particular, the long genes with high GC content appear to be where the primary differences are found, in keeping with the previous PCA analysis.
pcaPost_tumor <- logCPM_tumor %>%
t() %>%
prcomp()
The strong inter-patient heterogeneity is sitll evident even post-normalization.
g1 <- pcaPost_tumor %>%
autoplot(
data = dge_tumor$samples,
x = 1, y = 2,
colour = "treat",
shape = "treat",
size = 3
) +
facet_wrap(~patient, ncol = 4) +
scale_colour_manual(values= treat_cols) +
scale_shape_manual(values = treat_shapes) +
labs(
colour = "Treatment",
shape = "Treatment"
)
g2 <- pcaPost_tumor %>%
autoplot(
data = dge_tumor$samples,
x = 1, y = 2,
colour = "treat",
shape = "treat",
size = 3
) +
scale_colour_manual(values= treat_cols) +
scale_shape_manual(values = treat_shapes) +
labs(
colour = "Treatment",
shape = "Treatment"
)
# pdf(file = here::here("figure/pcaPostNor_tumor.pdf"),
# width = 12,
# height = 5)
plot_grid(g1, g2,
ncol = 2,
rel_widths = c(1.3,1),
scale = c(0.9, 0.8),
labels = c("(a)", "(b)"))
PCA on logCPM after performing CQN for (a) each of, (b) all of the 8 tumor patients. No common direction based on treatment appears evident with each patient. Great inter-patient heterogenity still exists even post-normalisation.
# dev.off()
dge_tumor$genes %>%
dplyr::select(gene_id, ave_tx_len, gc_content) %>%
mutate(
GC = cut(
x = gc_content,
labels = seq_len(10),
breaks = quantile(gc_content, probs = seq(0, 1, length.out = 11)),
include.lowest = TRUE
),
Length = cut(
x = ave_tx_len,
labels = seq_len(10),
breaks = quantile(ave_tx_len, probs = seq(0, 1, length.out = 11)),
include.lowest = TRUE
),
bin = paste(GC, Length, sep = "_"),
`pre-CQN` = pca_tumor$rotation[gene_id, "PC1"],
`post-CQN` = pcaPost_tumor$rotation[gene_id, "PC1"]
) %>%
pivot_longer(
cols = c("pre-CQN", "post-CQN"),
names_to = "status",
values_to = "value"
) %>%
mutate(status = factor(status, levels = c("pre-CQN", "post-CQN"))) %>%
group_by(status, GC, Length, bin) %>%
summarise(
Size = n(),
mean = mean(value),
sd = sd(value),
t = t.test(value)$statistic,
p = t.test(value)$p.value,
FDR = p.adjust(p, method = "bonf")
) %>%
ggplot(
aes(Length, GC, colour = t, alpha = -log10(FDR), size = Size)
) +
geom_point() +
facet_wrap(~status) +
scale_colour_gradient2() +
scale_size_continuous(range = c(1, 10)) +
theme(
panel.grid = element_blank(),
legend.position = "bottom"
)
Contributions of gene length and GC content to PC1 shown both before and after CQN among tumor samples.. The strong patterns seen before normalisation have clearly been reduced. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values.
Because of the high inter-patient heterogeneity observed, correlation between the first three PCA components and sample metadata was tested.
As shown in the figure below, pathologist’s labels on the tumours strongly correlated with the first three PCA components.
pcaPost_tumor$x %>%
as.data.frame() %>%
rownames_to_column("name") %>%
left_join(tumor_sample) %>%
dplyr::select(PC1, PC2, PC3, patient, treat, Diagnosis) %>%
dplyr::rename(
Treatment = treat,
Patient = patient
) %>%
mutate_all(as.numeric) %>%
cor() %>%
corrplot(
type = "lower",
diag = FALSE,
addCoef.col = 1,
col = rev(COL2('RdBu', 200)),
addCoefasPercent = TRUE
)
Correlations between the first three principal components and sample metadata in tumor samples. Treatment, patient, and pathologist diagnosis were converted to an ordered categorical variable for the purposes of visualisation.
formatP <- function(p, m = 0.0001){
out <- rep("", length(p))
out[p < m] <- sprintf("%.2e", p[p<m])
out[p >= m] <- sprintf("%.4f", p[p>=m])
out
}
Average treatment responses among tumor samples were tested through
fitting quasi-likelihood negative binomial GLMs through
edgeR.
X_tumor <- model.matrix(~ 0 + patient + treat, data = dge_tumor$samples %>%
droplevels()) %>%
set_colnames(str_remove_all(colnames(.), "treat"))
dge_tumor <- estimateDisp(dge_tumor, design = X_tumor, robust = TRUE)
fit_tumor <- glmQLFit(dge_tumor)
colnames(X_tumor) <- make.names(colnames(X_tumor))
contrast_tumor <- makeContrasts(DHT = DHT,
E2 = E2,
`E2+DHT` = E2.DHT,
`E2+DHTvsDHT` = E2.DHT - DHT,
`E2+DHTvsE2` = E2.DHT - E2,
levels = X_tumor)
alpha <- 0.05
topTables_tumor <- colnames(contrast_tumor) %>%
str_subset("patient", TRUE) %>%
sapply(function(x){
glmQLFTest(fit_tumor, contrast = contrast_tumor[,x]) %>%
topTags(n = Inf) %>%
.[["table"]] %>%
as_tibble() %>%
mutate(
coef = x,
location = paste0(seqnames, ":", start, "-", end, ":", strand),
rankingStat = -sign(logFC)*log10(PValue),
signedRank = rank(rankingStat),
DE = FDR < alpha
) %>%
dplyr::select(
gene_id, gene_name, logCPM, logFC, PValue, FDR, coef,
location, gene_biotype, entrezid, ave_tx_len, gc_content,
rankingStat, signedRank, DE
)
},
simplify = FALSE)
# saveRDS(topTables_tumor, here::here("output/topTables_tumor.rds"))
No DEG was detected in any of the treatment or contrast group:
topTables_tumor %>%
lapply(dplyr::filter, FDR < alpha) %>%
vapply(nrow, integer(1)) %>%
set_names(c("DHT vs Veh", "E2 vs Veh","E2+DHT vs Veh", "E2+DHT vs DHT", "E2+DHT vs E2" )) %>%
pander()
| DHT vs Veh | E2 vs Veh | E2+DHT vs Veh | E2+DHT vs DHT | E2+DHT vs E2 |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
The full DE analysis results are available in the table below:
topTables_tumor %>%
# lapply(dplyr::filter, FDR < alpha) %>%
lapply(dplyr::select, gene_name, logFC, FDR, coef, DE) %>%
bind_rows() %>%
mutate_if(is.numeric, signif, 2) %>%
mutate_at(vars(contains(c("gene_name", "coef"))), as.factor) %>%
mutate(Direction = ifelse(logFC < 0, "Down", "Up")) %>%
dplyr::rename(`Gene name` = gene_name,
`Comparision` = coef) %>%
datatable(
filter = "top",
options = list(scrollY = '5500px')) %>%
formatStyle(
c("logFC", "Direction"),
color = styleEqual(c("Down", "Up"), c('blue', 'red'))
)
Code used to generate figures for the thesis.
tumor_sample <- tumor_sample %>%
.[match(rownames(pcaPost_tumor$x), .$name),]
pcaPost_tumor_df <- pcaPost_tumor$x %>%
as.data.frame() %>%
rownames_to_column("name") %>%
left_join(
tumor_sample
)
thesis_1b <- pcaPost_tumor_df %>%
ggplot(
aes(PC1, PC2)
) +
geom_point(aes( color = patient, shape = treat),
size = 4) +
scale_shape_manual(values = treat_shapes) +
labs(
colour = "Patient",
fill = "Patient",
shape = "Treatment"
) +
geom_mark_ellipse(
aes(
label = patient,
fill = patient
),
label.fontsize = 24
) +
guides(color = FALSE, fill = FALSE) +
theme(
panel.grid = element_blank(),
text = element_text(size = 40)
)
# png("~/PhD_thesis/draft_figure/chapter_04/Figure1B.png",
# width = 1200,
# height = 600)
# thesis_1b
# dev.off()
# png("~/PhD_thesis/draft_figure/chapter_04/Figure1C.png",
# width = 1000,
# height = 1000)
# pcaPost_tumor$x %>%
# as.data.frame() %>%
# rownames_to_column("name") %>%
# left_join(tumor_sample) %>%
# dplyr::select(PC1, PC2, PC3, patient, treat, Age, Diagnosis) %>%
# dplyr::rename(
# Treatment = treat,
# Patient = patient,
# `Pathology Diagnosis` = Diagnosis
# ) %>%
# mutate(Age = as.numeric(as.character(Age)),
# across(-"Age", as.numeric)) %>%
# cor() %>%
# corrplot(
# type = "lower",
# diag = FALSE, addCoef.col = "black",
# col = rev(COL2('RdBu', 200)),
# addCoefasPercent = TRUE,
# number.cex = 2,
# tl.cex = 2,
# tl.col = "black",
# cl.cex = 2
# )
# dev.off()
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Australia/Adelaide
tzcode source: internal
attached base packages:
[1] splines stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggforce_0.4.1 reactable_0.4.4 WGCNA_1.72-1
[4] fastcluster_1.2.3 dynamicTreeCut_1.63-1 goseq_1.52.0
[7] geneLenDataBase_1.36.0 BiasedUrn_2.0.10 broom_1.0.5
[10] msigdbr_7.5.1 singscore_1.20.0 corrplot_0.92
[13] UpSetR_1.4.0 BiocParallel_1.34.2 DT_0.28
[16] ggrepel_0.9.3 cqn_1.46.0 quantreg_5.95
[19] SparseM_1.81 preprocessCore_1.62.1 nor1mix_1.3-0
[22] mclust_6.0.0 magrittr_2.0.3 ggfortify_0.4.16
[25] cowplot_1.1.1 ensembldb_2.24.0 AnnotationFilter_1.24.0
[28] GenomicFeatures_1.52.1 AnnotationDbi_1.62.2 Biobase_2.60.0
[31] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1 IRanges_2.34.1
[34] S4Vectors_0.38.1 AnnotationHub_3.8.0 BiocFileCache_2.8.0
[37] dbplyr_2.3.3 BiocGenerics_0.46.0 edgeR_3.42.4
[40] limma_3.56.2 glue_1.6.2 pander_0.6.5
[43] scales_1.2.1 yaml_2.3.7 lubridate_1.9.2
[46] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
[49] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
[52] tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] later_1.3.1 BiocIO_1.10.0
[3] bitops_1.0-7 filelock_1.0.2
[5] polyclip_1.10-4 graph_1.78.0
[7] rpart_4.1.19 XML_3.99-0.14
[9] lifecycle_1.0.3 doParallel_1.0.17
[11] rprojroot_2.0.3 vroom_1.6.3
[13] lattice_0.21-8 MASS_7.3-60
[15] crosstalk_1.2.0 backports_1.4.1
[17] Hmisc_5.1-0 sass_0.4.6
[19] rmarkdown_2.23 jquerylib_0.1.4
[21] httpuv_1.6.11 DBI_1.1.3
[23] zlibbioc_1.46.0 RCurl_1.98-1.12
[25] nnet_7.3-19 tweenr_2.0.2
[27] rappdirs_0.3.3 git2r_0.32.0
[29] GenomeInfoDbData_1.2.10 MatrixModels_0.5-2
[31] annotate_1.78.0 codetools_0.2-19
[33] DelayedArray_0.26.6 xml2_1.3.5
[35] tidyselect_1.2.0 farver_2.1.1
[37] base64enc_0.1-3 matrixStats_1.0.0
[39] GenomicAlignments_1.36.0 jsonlite_1.8.7
[41] Formula_1.2-5 ellipsis_0.3.2
[43] iterators_1.0.14 survival_3.5-5
[45] foreach_1.5.2 tools_4.3.0
[47] progress_1.2.2 Rcpp_1.0.11
[49] gridExtra_2.3 here_1.0.1
[51] xfun_0.39 mgcv_1.9-0
[53] MatrixGenerics_1.12.2 withr_2.5.0
[55] BiocManager_1.30.21 fastmap_1.1.1
[57] fansi_1.0.4 digest_0.6.33
[59] timechange_0.2.0 R6_2.5.1
[61] mime_0.12 colorspace_2.1-0
[63] GO.db_3.17.0 biomaRt_2.56.1
[65] RSQLite_2.3.1 utf8_1.2.3
[67] generics_0.1.3 data.table_1.14.8
[69] rtracklayer_1.60.0 prettyunits_1.1.1
[71] httr_1.4.6 htmlwidgets_1.6.2
[73] S4Arrays_1.0.4 pkgconfig_2.0.3
[75] gtable_0.3.3 blob_1.2.4
[77] impute_1.74.1 workflowr_1.7.0
[79] XVector_0.40.0 htmltools_0.5.5
[81] GSEABase_1.62.0 ProtGenerics_1.32.0
[83] png_0.1-8 knitr_1.43
[85] rstudioapi_0.15.0 tzdb_0.4.0
[87] reshape2_1.4.4 rjson_0.2.21
[89] checkmate_2.2.0 nlme_3.1-162
[91] curl_5.0.1 cachem_1.0.8
[93] BiocVersion_3.17.1 parallel_4.3.0
[95] foreign_0.8-84 restfulr_0.0.15
[97] pillar_1.9.0 grid_4.3.0
[99] vctrs_0.6.3 promises_1.2.0.1
[101] cluster_2.1.4 xtable_1.8-4
[103] htmlTable_2.4.1 evaluate_0.21
[105] cli_3.6.1 locfit_1.5-9.8
[107] compiler_4.3.0 Rsamtools_2.16.0
[109] rlang_1.1.1 crayon_1.5.2
[111] labeling_0.4.2 plyr_1.8.8
[113] fs_1.6.2 stringi_1.7.12
[115] babelgene_22.9 munsell_0.5.0
[117] Biostrings_2.68.1 lazyeval_0.2.2
[119] Matrix_1.6-0 hms_1.1.3
[121] bit64_4.0.5 statmod_1.5.0
[123] KEGGREST_1.40.0 shiny_1.7.4.1
[125] highr_0.10 SummarizedExperiment_1.30.2
[127] interactiveDisplayBase_1.38.0 memoise_2.0.1
[129] bslib_0.5.0 bit_4.0.5